DSCI 100 Group 17 Report: Classifying Celestial Bodies from Spectral Characteristics¶

Group members:

  • Aidan Wong
  • Ben Tyler
  • Tyson Quan

Introduction¶

Stars are large spheres of hot gas that emit heat and light into space. They are composed of mostly hydrogen, with some helium and other elements. The sun is an example of a star and is the closest star to Earth (NASA, n.d.b).

Galaxies are clusters of planets, stars, gasses, and dust that are all held together by gravity. Galaxies are very large and emit light from the stars and other things that it contains. The Milky Way Galaxy, where the Earth is located, is an example of a galaxy (NASA, n.d.a).

Quasars are the core of active galaxies and they are powered by supermassive black holes. They emit immense amounts of heat and light due to the friction of material being drawn in. The closest quasar to Earth is called 3C 273 and can be seen with an 8-inch telescope (Cooper, 2018).

The classification of celestial objects into stars, galaxies, and quasars has been pivotal for the understanding of planet Earth's positioning within space. It has led to key insights such as the discovery that the Andromeda galaxy is separate from our own, and this classification continues to be essential for astrological research (Clarke, 2020).

In this report, we will use data on celestial objects to answer the following question: "Based on its redshift and brightness in different wavelengths of light, what type of celestial object is this?"

Our data set is from Sloan Digital Sky Survey Data Release 16. It was collected by the Sloan Digital Sky Survey Telescope, which uses powerful telescopes measuring spectral characteristics of light (Fukugita et al., 1996). It contains data on 100,000 astronomical objects, divided into three classes: galaxies, stars, and quasars. The data it contains includes redshift, which reflects how quickly an object moves (Fedesoriano, 2022), and brightness in five wavelengths of light: ultraviolet, green, red, near infrared, and infrared light. We will focus on these six variables to help predict the class of astronomical objects.

Import Libraries¶

In [ ]:
import pandas as pd
import altair as alt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.compose import make_column_selector
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Settings on Juypter Notebook for printing and plotting graphs
set_config(transform_output="pandas")
alt.data_transformers.disable_max_rows()

# Seed to ensure reproducible report
np.random.seed(1234)

Methods and Results¶

This section consist of 3 main parts:

  1. Loading and Cleaning Data
  2. Exploratory Data Analysis
  3. Classifcation Analysis

Methods¶

In this section, we will explain the method we used to illustrate our findings.

Firstly, we would use six variables represented as columns in the data set: u, g, r, i, z and redshift. The first five variables are brightness values in different bands of light: ultraviolet, green, red, near-infrared, and infrared (Fukugita et al., 1996). They are measured in magnitude, which is unitless and reflects photon abundance (SDSS Voyages, 2024a). These magnitudes could help determine object class because quasars, galaxies, and stars can have unique colours (SDSS, n.d.a). We would also include redshift, which indicates the lengthening of an object's light wavelengths due to the expansion of the universe (SDSS, 2024b). Galaxies and quasars often have higher redshift values than stars, so higher redshift could indicate them (Crockett, 2021).

After gathering the necessary data, we proceeded with data preprocessing, which involved filtering and renaming the columns to ensure comprehensibility and ease of use. Once the dataset was cleaned and prepared, we conducted an exploratory analysis to gain a thorough understanding of the data. Initially, we examined the data types of each column and assessed the distribution of classes. Since we planned to perform K-Nearest Neighbor (KNN) classification, achieving a balanced distribution of classes was crucial for accurate results. To identify suitable variables for classification, we initially visualized the data using density plots, which helped us analyze the distinct characteristics exhibited by each variable.

After completing the exploratory analysis, we proceeded with the classification analysis using the six selected variables. As the class distribution in the original dataset was unbalanced, we performed upsampling to create a balanced dataset. Subsequently, we followed the standard procedure for KNN classification. This involved splitting the dataset into training and testing sets. We also created a pipeline containing a KNN model object and a preprocessor to standardize the numerical variables. To determine the optimal parameter k, we conducted 5-fold cross-validation on the training data set, which used the pipeline and tested k values from 2 to 14.

To visualize the results of the cross-validation, we created a plot of k values against estimated accuracy, which aided in selecting the appropriate k value. Next, we evaluated the performance of our classification model using scoring functions and cross-tabulation analysis to gain a comprehensive understanding of the model's results. Additionally, we created a pairplot to explore the relationships between each parameter used in the classification model. Based on these findings, we repeated the same procedure for a new set of chosen variables.

Upon completing both models, we reached conclusions based on our findings, which are presented in the following section.

1. Loading and Cleaning Data¶

In [ ]:
# Load in the data file from the web (Pandas, 2019).
url="https://drive.google.com/file/d/1LM-kB1xP90O9RBY5yjRP1mET_BKOOhxC/view?usp=sharing"
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
star_data = pd.read_csv(url)

star_data.head()
Out[ ]:
objid ra dec u g r i z run rerun camcol field specobjid class redshift plate mjd fiberid
0 1237666301628060000 47.372545 0.820621 18.69254 17.13867 16.55555 16.34662 16.17639 4849 301 5 771 8168632633242440000 STAR 0.000115 7255 56597 832
1 1237673706652430000 116.303083 42.455980 18.47633 17.30546 17.24116 17.32780 17.37114 6573 301 6 220 9333948945297330000 STAR -0.000093 8290 57364 868
2 1237671126974140000 172.756623 -8.785698 16.47714 15.31072 15.55971 15.72207 15.82471 5973 301 1 13 3221211255238850000 STAR 0.000165 2861 54583 42
3 1237665441518260000 201.224207 28.771290 18.63561 16.88346 16.09825 15.70987 15.43491 4649 301 3 121 2254061292459420000 GALAXY 0.058155 2002 53471 35
4 1237665441522840000 212.817222 26.625225 18.88325 17.87948 17.47037 17.17441 17.05235 4649 301 3 191 2390305906828010000 GALAXY 0.072210 2123 53793 74
In [ ]:
# Cleaning data
# Filter relevant columns and rename columns for a more comprehensible understanding
star_filtered = (
    star_data.loc[:, ["u", "g", "r", "i", "z", "redshift", "class"]]
    .rename(columns={
        "u":"Ultraviolet", 
        "g":"Green", 
        "r":"Red", 
        "i":"Near Infrared", 
        "z":"Infrared",
        "redshift":"Redshift",
        "class":"Class"
    })
)
star_filtered.head()
Out[ ]:
Ultraviolet Green Red Near Infrared Infrared Redshift Class
0 18.69254 17.13867 16.55555 16.34662 16.17639 0.000115 STAR
1 18.47633 17.30546 17.24116 17.32780 17.37114 -0.000093 STAR
2 16.47714 15.31072 15.55971 15.72207 15.82471 0.000165 STAR
3 18.63561 16.88346 16.09825 15.70987 15.43491 0.058155 GALAXY
4 18.88325 17.87948 17.47037 17.17441 17.05235 0.072210 GALAXY

2. Exploratory Data Analysis¶

This section performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis.

In [ ]:
# General understanding of the dataset
star_filtered.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Ultraviolet    100000 non-null  float64
 1   Green          100000 non-null  float64
 2   Red            100000 non-null  float64
 3   Near Infrared  100000 non-null  float64
 4   Infrared       100000 non-null  float64
 5   Redshift       100000 non-null  float64
 6   Class          100000 non-null  object 
dtypes: float64(6), object(1)
memory usage: 5.3+ MB
In [ ]:
# Understand the proportion of classes in the dataset to determine whether we have to upsample the data or not
star_filtered["Class"].value_counts(normalize=True)
Out[ ]:
Class
GALAXY    0.51323
STAR      0.38096
QSO       0.10581
Name: proportion, dtype: float64

From the above information, we are able to understand the data types of our dataset and the proportion of the classes.

With this information, we can conclude that we should upsample the data set to have a fair classification of celestial bodies.

This section creates a visualization of the data set that is relevant for exploratory data analysis related to the planned analysis.

In [ ]:
# Standardizing the data for plotting in the below sections
preprocessor_keep_all = make_column_transformer(
    (StandardScaler(), ['Ultraviolet', 'Green', 'Red', 'Near Infrared', 'Infrared', "Redshift"]),
    remainder="passthrough",
    verbose_feature_names_out=False
)

# Use Fit to compute all the neccessary values to scale the data
preprocessor_keep_all.fit(star_filtered)

# Transform function to apply the standardization
star_scaled = preprocessor_keep_all.transform(star_filtered)

star_scaled.head()

#star_scaled.nlargest(5, "Redshift")
Out[ ]:
Ultraviolet Green Red Near Infrared Infrared Redshift Class
0 0.065633 -0.272293 -0.287759 -0.230598 -0.226791 -0.389669 STAR
1 -0.194147 -0.103121 0.317192 0.580613 0.705310 -0.390143 STAR
2 -2.596213 -2.126356 -1.166443 -0.746957 -0.501159 -0.389555 STAR
3 -0.002769 -0.531149 -0.691260 -0.757044 -0.805267 -0.257025 GALAXY
4 0.294775 0.479099 0.519436 0.453794 0.456601 -0.224904 GALAXY

In the below code, we would like to plot a density plot as density plots are more effective for comparing multiple distributions.

With this density distribution, we would like to identify any variables that exhibits difference distributions between the different clusters (E.g. Star, Galaxy or Quasar).

In [ ]:
# Plotting the distribution of different characteristics values based on their class.
star_exploration_plot = alt.Chart(
    star_scaled.melt(
        id_vars=["Class"],
        var_name="Characteristics",
        value_name="Values",
    )
).transform_density(
    "Values",
    groupby=["Class", "Characteristics"],
    as_=["Values", "Density"]
).mark_area(opacity=0.6).encode(
    x=alt.X("Values").scale(base=10),
    y=alt.Y("Density:Q", title="Density"),
    color="Class:N"
).properties(
    width=150,
    height=150
).facet(
    alt.Facet(
        "Characteristics",
        sort=star_scaled.columns[:-1].tolist()
    ),
    columns=6
).resolve_scale(
    # We are setting the x-scale to "independent" since we standardized the rating values,
    # which means that their original range (which is what we show here) does not matter
    x="independent",
    y="independent"
)

star_exploration_plot 
Out[ ]:

Figure 1.1 For each variable, we plot a curve depicting the density of the standardized values corresponding to each class, where QSO represents quasar.

From the diagram above we can notice differences among the distributions. The redshift distribution might seem to be unclear and depending on the IDE of the user it might or not be able to show the graph clearly. hence, we would take a closer look at it.

In [ ]:
# We noticed that the redshift density is difficult to see so we scale it and remove values greater than 4
transform_star = star_scaled.sample(n=1000).melt( 
        id_vars=["Class"],
        var_name="Characteristics",
        value_name="Values",
)

star_redshift = transform_star[(transform_star["Characteristics"] == "Redshift") & (transform_star["Values"] < 4)]
star_redshift
Out[ ]:
Class Characteristics Values
5000 GALAXY Redshift -0.370767
5001 STAR Redshift -0.391511
5002 GALAXY Redshift -0.194813
5003 GALAXY Redshift -0.320388
5004 GALAXY Redshift -0.122430
... ... ... ...
5995 STAR Redshift -0.390189
5996 GALAXY Redshift -0.285476
5997 GALAXY Redshift -0.013136
5998 GALAXY Redshift -0.219365
5999 STAR Redshift -0.389177

981 rows × 3 columns

In [ ]:
# Plotting the distribution of different characteristics values based on their class.
star_exploration_plot_2 = alt.Chart(
    star_redshift
).transform_density(
    "Values",
    groupby=["Class", "Characteristics"],
    as_=["Values", "Density"]
).mark_area(opacity=0.6).encode(
    x=alt.X("Values").scale(base=10),
    # we scale the y-axis by the sqrt as we would like to reduce the values
    y=alt.Y("Density:Q", title="Density").scale(type="sqrt"),
    color="Class:N"
).properties(
    width=1000,
    height=150
).facet(
    alt.Facet(
        "Characteristics",
        sort=star_scaled.columns[:-1].tolist()
    ),
    columns=6
).resolve_scale(
    # We are setting the x-scale to "independent" since we standardized the rating values,
    # which means that their original range (which is what we show here) does not matter
    x="independent",
    y="independent"
)

star_exploration_plot_2 
Out[ ]:

Figure 1.2 we plot a curve depicting the density of the standardized values corresponding to each class for the redshift variable, where QSO represents quasar.

As shown in Figure 1.1 and 1.2, different classes tend to exhibit different characteristics based on the selected variables as their distributions within each plot generally differ from each other.

Hence, with this knowledge we could perform the KNN classification using these variables.

Exploration Analysis Conclusion¶

In conclusion, we should upsample our dataset as the proportion of classes is unbalanced. We are able to conduct classification on the following variables: Ultraviolet, Green, Red, Near Infrared, Infrared and Redshift as they exhibit different distributions among classes.

3. Classification Analysis¶

In [ ]:
# Splitting the data into training and testing data. We added stratify to ensure the classes of objects are distributed evenly in testing and training data.
star_train, star_test = train_test_split(
    star_filtered, train_size=0.75, stratify=star_filtered["Class"]
)

# We upsample the star and quasar classes to train our model on training data with equal proportions of classes, as quasars and stars are underrepresented.
QSO_train = star_train[star_train["Class"] == "QSO"]
STAR_train = star_train[star_train["Class"] == "STAR"]
GALAXY_train = star_train[star_train["Class"] == "GALAXY"]
QSO_upsample = QSO_train.sample(
    n=GALAXY_train.shape[0], replace=True
)
STAR_upsample = STAR_train.sample(
    n=GALAXY_train.shape[0], replace=True
)
upsampled_star_train = pd.concat((QSO_upsample, STAR_upsample, GALAXY_train))
In [ ]:
# We create our K-neighbours classifier object and preprocessor to standardize the training data. 
star_knn_1 = KNeighborsClassifier()

star_preprocessor = make_column_transformer(
    (StandardScaler(), ["Ultraviolet", "Green", "Red", "Near Infrared", "Infrared", "Redshift"])
)

star_pipeline = make_pipeline(star_preprocessor, star_knn_1)

# We use GridSearchCV to tune our model to estimate the k value with the most accuracy.
parameter_grid ={
    "kneighborsclassifier__n_neighbors" : range(2,15,1),
}

star_tune = GridSearchCV(
    star_pipeline,
    parameter_grid,
    cv=5,
    return_train_score=True,
    n_jobs=-1
)

# We create an object with the tuned model fitted to the training data to display the model's accuracy.
star_model = star_tune.fit(upsampled_star_train[["Ultraviolet", "Green", "Red", "Near Infrared", "Infrared", "Redshift"]], upsampled_star_train["Class"])

star_accuracy = pd.DataFrame(star_model.cv_results_)
In [ ]:
# We create a plot of k values against estimated accuracy to choose our k value.
accuracy_plot = alt.Chart(star_accuracy, title = "Accuracy estimates by nearest neighbours").mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("Number of neighbors").scale(zero=False),
    y=alt.Y("mean_test_score").title("Accuracy estimate").scale(zero=False)
)
accuracy_plot
# Our accuracy plot shows that we should use k = 2 in our classification model, as it is estimated to provide the highest accuracy.
Out[ ]:

Figure 2. Accuracy estimates from the model tuned on the training data, which are plotted against the number of neighbours in the KNN classification model.

In Figure 2, k = 2 is estimated to have the highest accuracy, which will therefore be used to classify the test data.

In [ ]:
# We use the tuned model to predict the class of each testing data observation in a new column in the testing data frame called "Prediction."
# The best neighbours value found two cells prior is stored in the star_tune object.
star_test["Prediction"] = star_tune.predict(
    star_test[["Ultraviolet", "Green", "Red", "Near Infrared", "Infrared", "Redshift"]]
)
In [ ]:
# This cell outputs the accuracy of our model on the training data (the number of correct predictions divided by the total number of predictions), which is 96.008%.
star_tune.score(
    star_test[["Ultraviolet", "Green", "Red", "Near Infrared", "Infrared", "Redshift"]],
    star_test["Class"]
)
Out[ ]:
0.96172
In [ ]:
# This cell outputs a confusion matrix for the test data, with each row representing the true class value, and each column representing the classes our model predicted.
pd.crosstab(star_test["Class"], star_test["Prediction"])
Out[ ]:
Prediction GALAXY QSO STAR
Class
GALAXY 12451 75 305
QSO 124 2521 0
STAR 440 13 9071
In [ ]:
# This cell outputs a pair plot, which plots a subsample of the data for each in each variable.
# This code was adapted from the Regression 2 Tutorial.
columns_to_plot = ["Ultraviolet", "Green", "Red", "Near Infrared", "Infrared", "Redshift"]

star_train_sample = star_train.sample(n=1000)

pairplot = alt.Chart(star_train_sample).mark_point().encode(
    alt.X(alt.repeat("row"), type="quantitative").scale(zero=False),
    alt.Y(alt.repeat("column"), type="quantitative").scale(zero=False),
).properties(
    width=200,
    height=200
).repeat(
    column=columns_to_plot,
    row=["Red"]
)
pairplot
Out[ ]:

Figure 3. All variables plotted against red brightness magnitude. A random subset of 1,000 observations from the data set are plotted to ensure computational efficiency. Red brightness magnitude appears most strongly correlated with green, near infrared, and infrared brightness magnitudes, and more weakly correlated with ultraviolet brightness magnitude and redshift.

Given the results of the previous plot, below, we repeat our classification procedure on the data using only the variables red, ultraviolet, and redshift, given the stronger correlation between red and the eliminated variables (green, near infrared, and infrared).

In [ ]:
# In the pair plot, we observed that the green, near infrared, and infrared predictor variables have a relatively strong positive correlation with the red variable.
# Therefore, we perform a classification without the green, near infrared, and infrared as predictors to compare the accuracy to our previous model.

star_knn_2 = KNeighborsClassifier()

star_preprocessor_2 = make_column_transformer(
    (StandardScaler(), ["Ultraviolet", "Red", "Redshift"])
)

star_pipeline_2 = make_pipeline(star_preprocessor_2, star_knn_2)

parameter_grid_2 ={
    "kneighborsclassifier__n_neighbors" : range(2,15,1),
}

star_tune_2 = GridSearchCV(
    star_pipeline_2,
    parameter_grid_2,
    cv=5,
    return_train_score=True,
    n_jobs=-1
)

star_model_2 = star_tune_2.fit(upsampled_star_train[["Ultraviolet", "Red", "Redshift"]], upsampled_star_train["Class"])

star_accuracy_2 = pd.DataFrame(star_model_2.cv_results_)
In [ ]:
# We create a plot of k values against estimated accuracy to choose our k value.
accuracy_plot_2 = alt.Chart(star_accuracy_2, title = "Accuracy estimates by nearest neighbours").mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("Neighbors").scale(zero=False),
    y=alt.Y("mean_test_score").title("Accuracy estimate").scale(zero=False)
)
accuracy_plot_2
# Our accuracy plot shows that we should use k =  in our classification model, as it is estimated to provide the highest accuracy.
Out[ ]:

Figure 4. Accuracy estimates from the model tuned on the training data, which are plotted against the number of neighbours in the KNN classification model.

From Figure 4, k = 2 is estimated to have the highest accuracy, which is the same result as our previous cross-validation.

In [ ]:
# We use the new tuned model to predict the class of each testing data observation in a new column in the testing data frame called "Prediction_2".
# The best neighbours value found two cells prior is stored in the star_tune object.
star_test["Prediction_2"] = star_tune_2.predict(
    star_test[["Ultraviolet", "Green", "Red", "Near Infrared", "Infrared", "Redshift"]]
)
In [ ]:
# This cell outputs the accuracy of our model on the training data, which is 95.308%.
star_tune_2.score(
    star_test[["Ultraviolet", "Red", "Redshift"]],
    star_test["Class"]
)
Out[ ]:
0.95508
In [ ]:
# This cell outputs a confusion matrix for the test data, with each row representing the true class value, and each column representing the classes our model predicted.
pd.crosstab(star_test["Class"], star_test["Prediction_2"])
Out[ ]:
Prediction_2 GALAXY QSO STAR
Class
GALAXY 12339 174 318
QSO 194 2450 1
STAR 430 6 9088
In [ ]:
# Pair plot showing redshift versus brightness magnitude for each light band, with point colours corresponding to the object class.
columns_to_plot = ["Ultraviolet", "Green", "Red", "Near Infrared", "Infrared", "Redshift"]

star_test_sample = star_test.sample(n=1000)

pairplot_2 = alt.Chart(star_test_sample).mark_point().encode(
    alt.X(alt.repeat("row"), type="quantitative").scale(zero=False),
    alt.Y(alt.repeat("column"), type="quantitative").scale(zero=False),
    color=alt.Color("Prediction").title("Prediction")
).properties(
    width=200,
    height=200
).repeat(
    column=columns_to_plot,
    row=["Redshift"]
)
pairplot_2
Out[ ]:

Figure 5. All variables plotted against redshift. A random subset of 1,000 observations from the training data set are plotted to ensure computational efficiency. In all plots, stars appear to have the lowest redshift values and often appear to have the second highest brightness values compared to the other classes. Galaxies appear to have the second lowest redshift values in general, and often the lowest brightness. Quasars (QSO) generally have the highest redshift and brightness values compared to the other classes. Quasars have much wider ranges for redshift values, but smaller ranges for brightness values compared to the other classes.

Discussion¶

Findings: We found that our first classification model, using all predictor variables, had an accuracy of 96.008%, meaning that 96% of the time, it correctly classified the test data objects. The accuracy of our second model was slightly lower; its accuracy was 95.308%, so it correctly classified the test data about 95% of the time. Both these accuracy values are quite high, as more than 19 times out of 20, our model was correct on testing data. This also demonstrates that when eliminating three variables (green, near infrared, and infrared brightness) that have relatively strong correlations with another variable (red brightness) as shown in Figure 3, we can generate a model that performs almost as well as the original model that uses all the variables.

Expectations:

  • Starting this project, we expected to be able to classify astronomical objects as stars, galaxies, or quasars based on their redshift and light emissions in five wavelengths. Our results show that we can achieve similar classification accuracy using redshift and only two wavelengths of light, red and ultraviolet, as opposed to all five.
  • In our exploratory data analysis, the distributions of stars and galaxies are fairly similar in the light magnitude variables. Therefore, we considered that our models might have trouble distinguishing between galaxies and stars. However, only 3% of galaxies in both models were incorrectly classified as stars, which did not match our expectation.
  • We also expected objects with high brightness and large redshift to be classified as quasars. In Figure 5, many of the points in the top middle and right areas of the graph are orange, indicating they were classified as quasars, which matched our expectation. However, our confusion matrices showed that 4% and 7% of quasars in our first and second models were classified as galaxies. Perhaps these objects were particularly dim quasars with low redshift that resembled galaxies in the data, suggesting an area our models could improve.
  • Figure 5 shows that objects we classified as stars generally have the lowest redshift values but can extend to the highest brightness values of all objects. Finding this trend fulfills our expectation of determining additional patterns between the object’s class and its redshift and brightness values.

Impacts of findings:

  • Our models could potentially be used to categorize astronomical objects based on real brightness and redshift data. Our first model would likely provide slightly higher accuracy given its better accuracy score on test data, while our second model only uses three predictor variables instead of six, so it could be less computationally intensive. This classification could allow astronomers to find new properties present in the different object classes to expand our astronomical knowledge on stars, quasars, and galaxies.
  • These models could also be used to observe how quasars, stars, and galaxies are distributed in the night sky. This could improve our knowledge on distributional and clustering patterns of these objects.
  • After classification with our models, researchers could also examine the objects’ properties at different redshifts. Given that more distant objects tend to have higher redshifts, perhaps more distant quasars, stars, and galaxies show unique characteristics compared to closer ones (Knox, 2023).

Future Questions:

  • While our model might distinguish between Galaxy, Star and Quasar, we could pursue further subcategorization. We might ask: What type of galaxy is this? Perhaps this could reveal that a certain galaxy type has particularly high brightness, which might include the galaxies in Figure 5 with exceptionally high brightness values.
  • We could explore how different magnitudes of light bands correlate with physical attributes like size. After classifying objects with our models, one could investigate the size of these objects to see if relationships between light magnitude and size exists.
  • To improve our models’ accuracies, another question arises: What are additional helpful predictor variables we could introduce? For example, we might consider variables like the chemical composition of these objects (Center for Astrophysics, n.d.). If this variable was different between objects, it might provide distinct signals within light detectable by a classifier to raise model accuracy.

Data set attribution¶

"Sloan Digital Sky Survey DR16" by Mukharbek Organokov, used under CC BY 4.0. Link to data set: https://www.kaggle.com/datasets/muhakabartay/sloan-digital-sky-survey-dr16 Link to license: https://creativecommons.org/licenses/by-sa/4.0/. We acknowledge the original data is from the Sloan Digital Sky Survey Data Release 16.

References¶

  • Camera. SDSS. (2022a). https://www.sdss4.org/instruments/camera/#Filters
  • Center for Astrophysics. (n.d.) Spectroscopy. https://www.cfa.harvard.edu/research/topic/spectroscopy
  • Clarke, A. O., Scaife, A. M. M., Greenhalgh, R., & Griguta, V. (2020, July 13). Identifying galaxies, quasars, and stars with Machine Learning: A new catalogue of classifications for 111 million SDSS sources without spectra. Astronomy & Astrophysics. https://www.aanda.org/articles/aa/full_html/2020/07/aa36770-19/aa36770-19.html#R16
  • Cooper, K. (2018, February 24). Quasars: Everything you need to know about the brightest objects in the universe. Space.com. https://www.space.com/17262-quasar-definition.html
  • Crockett, C. (2021, January 24). What do redshifts tell astronomers? https://earthsky.org/astronomy-essentials/what-is-a-redshift/
  • Data release 17. SDSS. (2022b). https://www.sdss4.org/dr17/
  • Fedesoriano. (2022, January 15). Stellar classification dataset - SDSS17. Kaggle. https://www.kaggle.com/datasets/fedesoriano/stellar-classification-dataset-sdss17/data
  • Fukugita, M., Ichikawa, Y., Gunn, J. E., Doi, M., Shimasaku, K., & Schneider, D. P. (1996). The Sloan Digital Sky Survey Photometric System. The Astronomical Journal, 111(4), 1748-1756. https://doi.org/10.1086/117915
  • Knox, L. (2023, January 5). 6: Redshifts. Physics LibreTexts. https://phys.libretexts.org/Courses/University_of_California_Davis/UCD%3A_Physics_156_-_A_Cosmology_Workbook/Workbook/06._Redshifts_(INCOMPLETE)
  • NASA. (n.d.-a). Galaxies. NASA. https://science.nasa.gov/universe/galaxies/
  • NASA. (n.d.-b). Stars. NASA. https://science.nasa.gov/universe/stars/
  • Pandas: How to read CSV file from Google Drive public? (2019). Stack Overflow. Retrieved March 9, 2024, at https://stackoverflow.com/questions/56611698/pandas-how-to-read-csv-file-from-google-drive-public
  • SDSS Voyages. (2024a). Photometric information. https://voyages.sdss.org/help/explore-main-window/photometric-information/
  • Sloan Digital Sky Survey. (n.d.a). Galaxies. Retrieved March 9, 2024, from https://skyserver.sdss.org/dr1/en/astro/galaxies/galaxies.asp
  • Sloan Digital Sky Survey. (n.d.b). Redshifts. Retrieved March 9, 2024, from https://skyserver.sdss.org/dr16/en/proj/advanced/hubble/redshifts.aspx